Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: performance issue in interpretability notebooks #1238

Merged
merged 1 commit into from
Nov 3, 2021

Conversation

memoryz
Copy link
Contributor

@memoryz memoryz commented Nov 3, 2021

In the notebook, the background data should be broadcasted. When the explain_instances dataframe (observations to be explained) is in a mid range (50 to 100-ish), Spark will use a unexpected type of join plan, and messes up with the parallelization of the Kernel SHAP sampler, thus creating a performance bottleneck. Broadcasting the background dataset makes Spark respect the partitioning of the explain_instances dataframe.

These two notebooks both explain only 5 data points, so the performance bottleneck is not obvious. However, if we change 5 to 50, it becomes obvious. But if we further change it 500, Spark uses the intended join plan, and the bottleneck is not triggered.

I thought about forcing the broadcast inside the explainer, but this may create unexpected effect for other scenarios, so I'm hesitant to do so.

@memoryz
Copy link
Contributor Author

memoryz commented Nov 3, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2021

Codecov Report

Merging #1238 (140641f) into master (81f5f80) will increase coverage by 0.20%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1238      +/-   ##
==========================================
+ Coverage   83.38%   83.59%   +0.20%     
==========================================
  Files         277      264      -13     
  Lines       13094    12919     -175     
  Branches      634      634              
==========================================
- Hits        10918    10799     -119     
+ Misses       2176     2120      -56     
Impacted Files Coverage Δ
...ython/synapse/ml/vw/VowpalWabbitRegressionModel.py
...ain/python/synapse/ml/vw/VowpalWabbitClassifier.py
vw/src/main/python/synapse/ml/vw/__init__.py
...c/main/python/synapse/ml/nn/ConditionalBallTree.py
.../main/python/synapse/ml/recommendation/SARModel.py
...main/python/synapse/ml/vw/VowpalWabbitRegressor.py
...thon/synapse/ml/vw/VowpalWabbitContextualBandit.py
...synapse/ml/vw/VowpalWabbitContextualBanditModel.py
...e/ml/recommendation/RankingTrainValidationSplit.py
...n/synapse/ml/vw/VowpalWabbitClassificationModel.py
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81f5f80...140641f. Read the comment docs.

@memoryz memoryz marked this pull request as ready for review November 3, 2021 07:44
@memoryz memoryz merged commit 5733b85 into microsoft:master Nov 3, 2021
@memoryz memoryz deleted the jasowang/notebook branch November 3, 2021 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants